home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
The Works of John Ruskin
/
The Works of John Ruskin - Installation CD.iso
/
WorksSetup.exe
/
CD
/
DATABASE
/
TKNZTBLD.CPL
< prev
Wrap
Text File
|
1996-05-02
|
28KB
|
860 lines
####################################################################
#
#
# File: tknztbld.def
#
# Personal Library Software, July, 1993
# Tom Donaldson
#
# Tokenizer definitional data for table driven tokenizer:
# CplTabledRomanceTokenizer.
#
# The CplTabledRomanceTokenizer allows customization of tokenization by
# editing rules that define the operation of the tokenizer. Central
# concept is "word continuation" rules, defining characters-kinds that
# CANNOT be split from each other.
#
# History
# -------
#
# 29jul1993 tomd Performance improvements. Got rid of unused
# character classes. Defined all non-token chars
# to be "break" character class. Ordered
# user-defined character classes by expected
# frequency in text.
#
# Also added performance hints throughout.
#
# 26aug93 tomd No longer need to map all chars to char-classes.
# Unmapped ones will default to break-chars.
#
# No longer need to assign numeric values to character
# classes. No longer need to assign names to predefined
# character classes (and cannot).
#
# Canonizer map no longer has to contain all
# characters.
#
####################################################################
####################################################################
#
# Installation
# ============
#
# Database.def File
# -----------------
#
# To use the CplTabledRomanceTokenizer, you need this line in the .def
# file for the database:
#
# TOKENIZER = CplTabledRomanceTokenizer
#
#
# Tokenizer File
# --------------
#
# This file, tknztbld.def, is the rule file. Note that the name of the
# file CANNOT be changed, and the file MUST be in the "home directory"
# of the database using the tokenizer, or the "system" directory for
# the CPL installation.
#
####################################################################
####################################################################
#
# Operational Overview
# ====================
#
# Database Open
# -------------
#
# When a database is opened, its .def file is read. In the .def file,
# a non-default tokenizer may be specified via a line of the form
#
# TOKENIZER = aTokenizerName
#
# If aTokenizerName is "CplTabledRomanceTokenizer", as soon as the
# tokenizer is needed, it will try to read its definition file (i.e.,
# this tknztbld.def file) from the same directory as the database's .def
# file.
#
# If a problem arises during the load, tokenizer creation will fail. In
# this case, please do a "diff" between the file that failed to load and
# the original copy. Regretfully, few diagnostic error messages are
# currently available to assist in determining why tokenizer definition
# rules did not load.
#
# During Tokenization
# -------------------
#
# As a buffer is scanned for words, each character is converted to a
# "character class" via a Character Classification Map. The character
# classes are compared against tokenization rules, defined in this file.
# If the rules explicitly state that the current character code may not
# be separated from the preceding one, then the current character is
# treated as part of a word, and the next character is classified and
# tested. This process continues until the scanner finds a character
# whose classification is not specified in the rules as always being
# kept with the character class of the just-preceding character.
#
#
# Performance Hints
# =================
#
# The table driven tokenizer allows a great deal of flexibility in
# scanning words. It is possible to create a tokenizer definition that
# will scan complex patterns as tokens, or not, depending upon the
# immediate context of the characters being scanned. However, this
# flexibility comes at a price: performance.
#
# In general, the simpler and fewer the character classes and rules the
# faster tokenization will be. That is, the closer the rules come to
# generating tokens that would pass the simple isalnum() test, the
# faster tokenization will be. The further your needs depart from this
# simplicity, the longer it will take to index files.
#
#
####################################################################
###################################################################
#
#
# Required File Layout
# ====================
#
# This tokenizer definition file, tknztbld.def, must contain the
# following sections in this order:
#
# Section 1: Character Class Definitions
#
# Section 2: Character Classification Map
#
# Section 3: Word Continuation Rules
#
# Section 4: Canonization Map
#
# Section 1, the "Character Class Definitions", gives meaningful names
# to "kinds" of characters. These class names are used in defining
# what kinds of characters make up words.
#
# Section 2, the "Character Classification Map", assigns a character
# class to each character in the character set used by documents.
#
# Section 3, the "Word Continuation Rules", uses the character class
# names defined in the Character Class Definitions to specify what
# groupings of characters may not be separated from each other during
# tokenization.
#
# Section 4, the "Canonization Map", specifies translations of
# characters from their raw input form to their final "canonical"
# indexed form.
#
# The detailed structure of each section of the tokenizer definition is
# covered in comments accompanying the sections, below.
#
# The lines in this file that are preceded by the pound character, '#',
# are comment lines (obviously). Comments may only appear on a line by
# themselves, but comment lines may appear anywhere in the file.
# Likewise, blank lines may appear anywhere in the file.
#
####################################################################
####################################################################
#
# Section 1: Character Class Definitions
#
####################################################################
#
# The Character Class Definitions give names to the types of
# characters that can take part in words, and that delimit words. These
# names will be used in Section 2, the Character Classification Map, to
# assign character classes to individual characters.
#
# You may define up to 250 character classes, although fewer than 10
# will most likely be enough. Every character than can possibly ever
# appear in the data to be tokenized MUST be assigned one of the defined
# classes. The mapping is done via a character-class map, which appears
# at the end of this file.
#
#
# Predefined Special Values
# =========================
#
# There are four predefined values that are special to the tokenizer:
#
# Invalid - NO character should EVER be of this class.
#
# EndRule - This is a character class that is used in this
# definition file to mark the end of your character
# classification names. It must be the last character
# class name listed in the table below, and can only be
# the last one listed.
#
# Break --- Characters of the "Break" class can NEVER be part of a
# token. It is always valid for the tokenizer to
# word-break the data stream either before or after a Break
# character.
#
# EndBuff - Characters of the EndBuff class will be treated as a
# "null" terminating character when scanning data for
# tokens. The ASCII NUL character is an EndBuff character
# by default, and you will not normally map other
# characters to this class.
#
#
# Only predefined Character Class names, or Character Class names that
# are defined in this Character Class Definition section may be used
# anywhere else in the definition. Definitions are case sensitive.
#
#
# Performance Hints
# =================
#
# 1) Use The "Break" Class As Much As Possible. The Break class is a
# special character class. Minimal testing is done on characters
# classified as Break. This is because it is ALWAYS valid to separate a
# Break character from any other character. Any characters that will
# NEVER be part of a token should be classified as Break.
#
# 2) Define Class Names By Frequency. When creating user-defined character
# classes, list the classes that will be assigned to the largest numbers
# of characters in the data first. For example, you would expect most
# data in a text file to classified as "letter" characters; you should
# define your "letter" class name first.
#
# 3) Define As Few Classes As Possible. Fewer classes means less
# testing. Less testing means faster tokenization.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Default Character Class Definitions: Values and Names
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
#
# Name
# ----
Letter
Number
Underscore
Z_Special
#
# The EndRule character class must always be the last class name listed,
# and must only appear at the end of your character class definitions.
# If it is not at the end of char class defs, an error occurs as soon as
# the loader hits a non-blank, non-comment line that is not a character
# class definition.
#
EndRule
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
####################################################################
#
# Section 2: Character Classification Map
#
####################################################################
#
# Maps characters to word-continuation character classes.
#
# The Character Classification Map associates a character code with a
# character class. The character classes must have been defined in the
# Character Class Definition in Section 1. Only character class names
# that have been defined in the Character Class Definition (Section 1,
# above) may appear in this Character Classification Map (Section 2) or
# in the Word Continuation Rules (Section 3, below).
#
# Default Mapping
# ===============
#
# You only need to provide character classification for the characters
# that you want to appear in tokens.
#
# Any characters that you do NOT map will be classified in this way:
# - ASCII NUL is mapped as the end-of-buffer marker.
# - All other characters are mapped as break characters.
#
#
# End Of Map Marker
# =================
#
# As for the previous table, there is a special value for this Character
# Classification Map that marks its end. The special value is -1. The
# decimal character code -1 will cause the Character Classification Map
# loader to stop reading.
#
#
# Performance Hints
# =================
#
# Leave as many characters classified as "break" characters as possible.
# Classify as many characters to the same class as possible.
#
# The following sample table uses Dollar and Bang classes for tokenizing
# specialized technical documentation. If you don't need the Dollar and
# Bang characters in indexable terms, change their mapping to Break, and
# remove all references to Dollar and Bang in the Character Class
# Definitions and the Word Continuation Rules. Your database will index
# faster.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Character Classification Map
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# ------- ----- -----------------------
# Decimal Class
# Value Name Comment
# ------- ----- -----------------------
# Special characters:
# Digits:
48 Letter # Char '0'
49 Letter # Char '1'
50 Letter # Char '2'
51 Letter # Char '3'
52 Letter # Char '4'
53 Letter # Char '5'
54 Letter # Char '6'
55 Letter # Char '7'
56 Letter # Char '8'
57 Letter # Char '9'
# Upper case letters:
65 Letter # Char 'A'
66 Letter # Char 'B'
67 Letter # Char 'C'
68 Letter # Char 'D'
69 Letter # Char 'E'
70 Letter # Char 'F'
71 Letter # Char 'G'
72 Letter # Char 'H'
73 Letter # Char 'I'
74 Letter # Char 'J'
75 Letter # Char 'K'
76 Letter # Char 'L'
77 Letter # Char 'M'
78 Letter # Char 'N'
79 Letter # Char 'O'
80 Letter # Char 'P'
81 Letter # Char 'Q'
82 Letter # Char 'R'
83 Letter # Char 'S'
84 Letter # Char 'T'
85 Letter # Char 'U'
86 Letter # Char 'V'
87 Letter # Char 'W'
88 Letter # Char 'X'
89 Letter # Char 'Y'
90 Letter # Char 'Z'
# Lower case letters:
97 Letter # Char 'a'
98 Letter # Char 'b'
99 Letter # Char 'c'
100 Letter # Char 'd'
101 Letter # Char 'e'
102 Letter # Char 'f'
103 Letter # Char 'g'
104 Letter # Char 'h'
105 Letter # Char 'i'
106 Letter # Char 'j'
107 Letter # Char 'k'
108 Letter # Char 'l'
109 Letter # Char 'm'
110 Letter # Char 'n'
111 Letter # Char 'o'
112 Letter # Char 'p'
113 Letter # Char 'q'
114 Letter # Char 'r'
115 Letter # Char 's'
116 Letter # Char 't'
117 Letter # Char 'u'
118 Letter # Char 'v'
119 Letter # Char 'w'
120 Letter # Char 'x'
121 Letter # Char 'y'
122 Letter # Char 'z'
156 Letter # Char 'z'
161 Letter # Char 'z'
192 Letter # Char 'z'
193 Letter # Char 'z'
194 Letter # Char 'z'
195 Letter # Char 'z'
196 Letter # Char 'z'
197 Letter # Char 'z'
198 Letter # Char 'z'
199 Letter # Char 'z'
200 Letter # Char 'z'
201 Letter # Char 'z'
202 Letter # Char 'z'
203 Letter # Char 'z'
204 Letter # Char 'z'
205 Letter # Char 'z'
206 Letter # Char 'z'
207 Letter # Char 'z'
208 Letter # Char 'z'
209 Letter # Char 'z'
210 Letter # Char 'z'
211 Letter # Char 'z'
212 Letter # Char 'z'
213 Letter # Char 'z'
214 Letter # Char 'z'
215 Letter # Char 'z'
216 Letter # Char 'z'
217 Letter # Char 'z'
218 Letter # Char 'z'
219 Letter # Char 'z'
220 Letter # Char 'z'
221 Letter # Char 'z'
222 Letter # Char 'z'
223 Letter # Char 'z'
224 Letter # Char 'z'
225 Letter # Char 'z'
226 Letter # Char 'z'
227 Letter # Char 'z'
228 Letter # Char 'z'
229 Letter # Char 'z'
230 Letter # Char 'z'
231 Letter # Char 'z'
232 Letter # Char 'z'
233 Letter # Char 'z'
234 Letter # Char 'z'
235 Letter # Char 'z'
236 Letter # Char 'z'
237 Letter # Char 'z'
238 Letter # Char 'z'
239 Letter # Char 'z'
240 Letter # Char 'z'
241 Letter # Char 'z'
242 Letter # Char 'z'
243 Letter # Char 'z'
244 Letter # Char 'z'
245 Letter # Char 'z'
246 Letter # Char 'z'
248 Letter # Char 'z'
249 Letter # Char 'z'
250 Letter # Char 'z'
251 Letter # Char 'z'
252 Letter # Char 'z'
253 Letter # Char 'z'
254 Letter # Char 'z'
255 Letter # Char 'z'
90 Z_Special # Char 'Z'
95 Underscore # Char '_'
# --- ----- -----------------------
-1 EndOfDefs # Not loaded. Just marks end of map definition.
# --- ----- -----------------------
####################################################################
#
# Section 3: Word Continuation Rules
#
####################################################################
#
# The word continuation rules specify which sequences of characters
# CANNOT be separated from each other in breaking a stream of data into
# words.
#
# Each rule consists of character class names separated by spaces or
# tabs. A rule says that characters of the specified classes may not be
# split. For example, the rule:
#
# Letter Letter
#
# says that when two data characters are of class Letter, and occur side
# by side, the data characters may not be separated.
#
# Similarly, the rule:
#
# Letter Number
#
# says that a character that classifies as a Letter may not be separated
# from a following character that classifies as a Number.
#
# Example 1:
#
# How does the tokenizer decide whether a character is a Letter or a
# Number? That association is formed by the Character Classification
# Map, in Section 2. Using the Character Classification Map in this
# file, and the two "Letter Letter" and "Letter Number" rules just
# presented, the following input text:
#
# "A-1 B 2 C3 4D 5 E 6-F GH IJ7LMNOP"
#
# Will tokenize as:
#
# "A" "B" "C3" "D" "E" "F" "GH" "IJ7" "LMNOP"
#
# Because: A Letter may not be separated from a following Letter, and a
# Letter may not be separated from a following Number. However, a
# Number may be separated from a following Letter, and all other
# characters are considered as delimiters. Obviously, we need more
# rules. A more complete sample set follows.
# Character Class Names, and '*'
# ------------------------------
#
# All character class names in each rule MUST have been defined in the
# Character Class Definitions in Section 1, above, with the exception of
# one special "name": the '*' character. The '*' character means that
# characters of the preceding class may occur one or more times in a
# sequence.
#
# Note that in previous rule we said that no two letters could be
# separated. We did this with the rule:
#
# Letter Letter
#
# But what if a single letter, such as the "A" in "Section A", occurs by
# itself? The "Letter Letter" rule does NOT say that "A" is a valid
# token. The following two rules together DO say that a single letter,
# or any number of letters in a row, must be treated as a token:
#
# Letter
# Letter Letter
#
# However, we can reduce this to a single rule using the special
# character class name "*":
#
# Letter *
#
#
#
# Example 2: "Unusual" Characters In Tokens
#
# This rule:
#
# Dollar * Letter
#
# Says that a stream of Dollar characters may not be broken if they are
# followed by a Letter. This rule will cause these strings to be
# treated as words:
#
# "$$SysDevice"
# "$Fr$ed"
# "$x8"
#
# But the same "Dollar * Letter" rule will not accept these strings:
#
# "SysDevice$$$" -- Token will be "SysDevice", "$$$" is junk.
# "Fr$ed" -- Tokens will be "Fr" and "$ed".
# "x$8" -- Token will be "x", the "$" and "8" will be
# discarded.
#
#
#
# Example 3: More Complex Rules
#
# Using the example rules to this point, the string:
#
# "tomd@pls.com"
#
# Will be tokenized as:
#
# "tomd" "pls" "com"
#
# To cause tomd@pls.com to be accepted as a token, we can define this
# rule:
#
# Letter * AtSign Letter * Dot Letter *
#
# Or define these equivalent rules:
#
# Letter *
# Letter AtSign
# AtSign Letter
# Letter Dot Letter
#
#
#
# Implicit Linking of Rules
# -------------------------
#
# It is important to note that rules functionally link to each other.
# For example, we used these two rules in the previous example:
#
# Letter AtSign
# AtSign Letter
#
# That is, a Letter may not be separated from a following AtSign, and an
# AtSign may not be separated from a following Letter, which
# functionally has the same effect as:
#
# Letter AtSign Letter
#
# Thus, the last character class on a rule can match-up with the same
# character class at the head of another rule (or the same rule) to
# match longer strings. In fact, the tokenizer does this, internally,
# to create the longest tokens it can from an input stream.
#
#
# EndRule: End of Definitions Marker
# ----------------------------------
#
# The last "rule" in the list of rules MUST consist of solely the
# EndRule character class (i.e., the rule with Value 1). It tells the
# Word Continuation Rule loader that it is finished. If the EndRule is
# missing, the Word Continuation Rule loader will try to eat any
# following data as continuation rules, and will fail.
#
# Default Word Continuation Rules
# -------------------------------
#
# The following, short, rule set will create tokens consisting of
# Numbers, Letters, and Dollars freely intermixed. It will NOT create
# tokens containing ONLY Dollars; the rules state that for a Dollar to
# be part of a token, the Dollar must be followed by a Letter. The
# rules also allow a single bang character, "!", at the beginning of a
# token. Note that these rules are intended for a particular type of
# technical documentation, and will probably not exactly fit your needs.
#
#
# Examples Using Default Word Continuation Rules
# ----------------------------------------------
#
# Example 1':
#
# The string from Example 1:
#
# "A-1 B 2 C3 4D 5 E 6-F GH IJ7LMNOP"
#
# Previously tokenized as:
#
# "A" "B" "C3" "D" "E" "F" "GH" "IJ7" "LMNOP"
#
# With the default rules, below, tokenizes as:
#
# "A" "1" "B" "2" "C3" "4D" "5" "E" "6" "F" "GH" "IJ7MNOP"
#
#
# Example 2':
#
# The strings from Example 2:
#
# "$$SysDevice"
# "$Fr$ed"
# "$x8"
#
# Previously tokenized as:
#
# "SysDevice" "Fr" "$ed" "x"
#
# With the default rules, below, tokenize as:
#
# "$$SysDevice" "$Fr$ed" "$x8"
#
#
#
# Example 3':
#
# The string from Example 3:
#
# "tomd@pls.com"
#
# Will still be tokenized as (lacking the AtSign and Dot rules):
#
# "tomd" "pls" "com"
#
#
#
#
# Performance Hints
# =================
#
#
# 1) Keep Rules As Simple As Possible. The simpler the rule, the faster
# the word-continuation tests. If a rule uses few character classes,
# and contains few items, it is "simple." If a rule uses more than two
# character classes, or uses two or more classes in different
# permutations, the rule is "complex."
#
# 2) Define As Few Rules As Possible. The fewer rules there are to
# check, the faster will be tokenization.
#
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Word Continuation Rules
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- -
#
# I apologize if this explanation has been overly long, especially given
# the brevity of the following rules. This documentation is partially
# for me (Tom Donaldson), and partially for anyone using the tokenizer.
# If I jot it down now, there is a better chance I will "remember" it
# all later!
#
Z_Special Underscore Letter *
Z_Special Underscore Z_Special * Letter *
Z_Special Underscore Letter * Z_Special *
Letter *
Number *
#
# Rule End MUST be the last rule in the continuation rules.
# It must NOT occur as part of any other rule.
#
EndRule
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
####################################################################
#
# Section 4: Canonization Map
#
####################################################################
#
# After the tokenizer returns a "word", based on the rules defined in
# the first three sections of this file, it must be put into a
# regularized, "canonical", form to make searches easier and faster. In
# fact, canonizing tokens can also drastically reduce the size of the
# database's dictionary.
#
# A common form of canonization is to convert all lower case letters to
# upper case, and use the upper cased terms in the index and during
# searches. This allows, for example, the word "NeXT" to match the
# words "next", "NEXT", "nExt", etc. Note that this also means that
# only one version of "next" is stored in the index, rather than all
# permutations on the case of the letters that might exist in the
# database.
#
# The canonization map allows you to determine what character by
# character transforms are performed during canonization. The default
# canonization supplied in the following table maps all lower case
# characters to upper case. All other values are mapped to themselves;
# that is, all other values are unchanged after canonization.
#
# For example, in some databases you might want to convert the "A WITH
# TILDE" to a plain "A". You can do this by specifying that the "A
# WITH TILDE" character, with character code 195, should be canonized
# to the character "A", with character code 65:
#
# 195 65 # Canonize A-tilde as A
#
#
# Default Values
# ==============
#
# You do not have to define a mapping for all characters. The default
# for all characters is to map to itself. Thus, your canonization
# mapping table need only contain characters which you want to have
# translated after tokenization.
#
# CRITICAL
# ========
#
# As for the previous table, there is a special value for this
# Canonization Map that marks its end. The special value is -1. The
# decimal character code -1 will cause the Canonization Map loader to
# stop reading.
#
#
# Performance Hints
# =================
#
# None.
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# Default Canonization Map
#
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
#
# ------- ------- -----------
# Input Output
# Decimal Decimal
# Char Char
# Value Value Comment
# ------- ------- -----------
#
# Map the characters a-z to the "canonical" characters A-Z. That is,
# all letters will be upper cased.
97 65 # Char 'a' canonizes to 'A'
98 66 # Char 'b' canonizes to 'B'
99 67 # Char 'c' canonizes to 'C'
100 68 # Char 'd' canonizes to 'D'
101 69 # Char 'e' canonizes to 'E'
102 70 # Char 'f' canonizes to 'F'
103 71 # Char 'g' canonizes to 'G'
104 72 # Char 'h' canonizes to 'H'
105 73 # Char 'i' canonizes to 'I'
106 74 # Char 'j' canonizes to 'J'
107 75 # Char 'k' canonizes to 'K'
108 76 # Char 'l' canonizes to 'L'
109 77 # Char 'm' canonizes to 'M'
110 78 # Char 'n' canonizes to 'N'
111 79 # Char 'o' canonizes to 'O'
112 80 # Char 'p' canonizes to 'P'
113 81 # Char 'q' canonizes to 'Q'
114 82 # Char 'r' canonizes to 'R'
115 83 # Char 's' canonizes to 'S'
116 84 # Char 't' canonizes to 'T'
117 85 # Char 'u' canonizes to 'U'
118 86 # Char 'v' canonizes to 'V'
119 87 # Char 'w' canonizes to 'W'
120 88 # Char 'x' canonizes to 'X'
121 89 # Char 'y' canonizes to 'Y'
122 90 # Char 'z' canonizes to 'Z'
# --- ----- -----------------------
-1 -1 # Not loaded. Just marks end of map definition.
# --- ----- -----------------------
####################################################################
#
#
# End Of File: tknztbld.def
#
#
####################################################################